[MXNET-614] Adding Synchronized Batch Normalization #11502

zhanghang1989 · 2018-06-29T23:43:38Z

Description

Adding Synchronized Batch Normalization
Thanks @eric-haibin-lin for great help!

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

zhanghang1989 · 2018-06-30T21:29:57Z

Help Wanted for passing the CI Test!!

zhreshold · 2018-07-02T17:36:04Z

python/mxnet/gluon/contrib/nn/basic_layers.py

+                        'ndev': num_devices, 'key': self.prefix}
+
+    def _get_num_devices(self):
+        # Caution: if not using all the GPUs, please mannually set num_devices


add the warning to docstring rather than showing a comment here

zhreshold · 2018-07-02T17:36:33Z

src/operator/contrib/sync_batch_norm-inl.h

+#include <dmlc/logging.h>
+#include <dmlc/parameter.h>
+#include <mxnet/operator.h>
+# include <condition_variable>


space between # and include?

zhreshold · 2018-07-02T17:38:16Z

src/operator/contrib/sync_batch_norm-inl.h

+template<class T>
+class SharedND {
+ private:
+  int nDev;


convention for variables is xxx_ for private members

and camel for functions, which is correct right now

zhreshold · 2018-07-02T17:44:49Z

src/operator/contrib/sync_batch_norm-inl.h

+    std::lock_guard<std::mutex> lock(mutex_);
+    auto it = registry_.find(key);
+    if (it != registry_.end()) return it->second;
+    T *newT = new T(ndev);


memory is not released pointed by these raw pointers

addressed in the class deconstruction function :)
https://github.com/zhanghang1989/incubator-mxnet/blob/cc60d11f44e37e954a627c8db43bd1b6fc45e68d/src/operator/contrib/sync_batch_norm-inl.h#L160-L164

This reverts commit 24543c9.

zhanghang1989 · 2018-07-10T21:13:14Z

Thanks @RogerChern ! The comments in deconstruction function is really helpful.

zhanghang1989 · 2018-07-11T18:35:50Z

Finally pass the CI Test. Please take a look and let me know if you have further comments. @zhreshold @eric-haibin-lin @zhreshold @piiswrong . Thanks!

Docs are deployed here http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-11502/31/api/python/gluon/contrib.html?highlight=syncbatchnorm#mxnet.gluon.contrib.nn.SyncBatchNorm.

eric-haibin-lin

some minor suggestions

eric-haibin-lin · 2018-07-12T06:29:48Z

tests/python/gpu/test_operator_gpu.py

+    _assert_tensor_close(_find_bn(bn1).running_var.data(ctx_list[0]),
+                         _find_bn(bn2).running_var.data(ctx_list[0]))
+    input2grad = mx.nd.concat(*[output.grad.as_in_context(input.context) for output in inputs2], dim=0)
+    #print('input1.grad', input1.grad)


Remove unused code

Yeah, Will do. Thx

eric-haibin-lin · 2018-07-12T06:31:26Z

tests/python/gpu/test_operator_gpu.py

+    _assert_tensor_close(input1.grad, input2grad)
+
+def test_sync_batchnorm():
+    def get_num_devices():


There's test_utils.list_gpus()

That is slightly different. list_gpus() doesn’t consider CUDA_VISIBLE_DEVICES

eric-haibin-lin · 2018-07-12T06:32:47Z

tests/python/gpu/test_operator_gpu.py

@@ -1909,6 +1909,91 @@ def test_context_num_gpus():
    # Test that num_gpus reports at least one GPU, as the test is run on a GPU host.
    assert mx.context.num_gpus() > 0

+def _check_batchnorm_result(input, num_devices=1, cuda=False):
+    from mxnet.gluon.utils import split_and_load
+    def _assert_tensor_close(a, b, atol=1e-3, rtol=1e-3):


will assert_almost_equal do?

zhreshold · 2018-07-12T06:54:08Z

src/operator/contrib/sync_batch_norm-inl.h

+  }
+
+  ~SharedND() {
+    mshadow::FreeSpace(&mean_);


check for data_inited_ before freeing memory

I Agree. Will make the changes. Thx

zhreshold · 2018-07-12T06:55:47Z

src/operator/contrib/sync_batch_norm-inl.h

+    }
+  }
+
+  T* Retrieve(mshadow::Shape<1> shape, int index) {


need doc for these member functions

zhreshold · 2018-07-12T06:59:36Z

src/operator/contrib/sync_batch_norm-inl.h

+  ~GlobalShared() {
+    for (auto it = registry_.begin(); it != registry_.end(); it++) {
+      T *ptr = it->second;
+      delete ptr;


again, you have to guarantee deleting valid pointer, since you didn't init them in the constructor, but in a public function

If not inited, the map should be empty

zhreshold · 2018-07-12T07:00:01Z

src/operator/contrib/sync_batch_norm-inl.h

+  }
+  ~GlobalSharedRank() {
+    for (auto it = registry_.begin(); it != registry_.end(); it++) {
+      T *ptr = it->second;


If not inited, the hash map should be empty

ok, should be fine

zhreshold · 2018-07-12T07:01:50Z

src/operator/contrib/sync_batch_norm-inl.h

+          mshadow::Shape2(5, mean.shape_[0]), s);
+      Tensor<xpu, 1> gmean = workspace[0];
+      Tensor<xpu, 1> gvar = workspace[1];
+      // Tensor<xpu, 1> tmp = workspace[2];


remove unused

zhreshold · 2018-07-12T07:05:51Z

Comments added. The rest LGTM now.

eric-haibin-lin · 2018-07-14T06:05:40Z

@indhub FYI

miteshyh · 2018-07-17T13:16:25Z

SyncBatchNorm class doesn't seem to be available from mxnet-cu91 nightly. Its visible for regular mxnet nightly. Are these changes merged fully?

eric-haibin-lin · 2018-07-17T16:34:18Z

@miteshyh mxnet-cu91 is for stable release. SyncBatchNorm will only appear in nightly distribution via --pre

szha · 2018-07-17T16:50:20Z

@miteshyh would you be able to update and use cu92? I heard from @bhavinthaker that nvidia discontinued support for cu91 so we intend to do the same.

miteshyh · 2018-07-20T14:28:11Z

Thanks @szha , I down graded to cu90 as cu92 doesn't have clean support on my hardware yet, and it works.

However while I train ADE20K with GluonCV I get "socket.error: [Errno 111] Connection refused" after a few (@551) iterations, I have raised a separate issue for the same. And this happens with/without SyncBatchNorm.

dmlc/gluon-cv#215

* sync batch norm * global rank and barrier * lint * cpplint * pylint * doc * add ref * customized barrier * cpplint * get rid of pthread * address comments * warning * pylint * gpu unitest * gpu 0 * mv to cpu test * Revert "mv to cpu test" This reverts commit 24543c9. * ndev = 2 * debuging * sum prod * lint * contrib, ngpu * code style * code style * forward backward * test * cpu test * fix deconstruction * doc indent * doc * doc * address comments * typo * asnumpy

jianchao-li · 2018-09-24T09:26:20Z

Set Rank and Barrier in forward and backward as separate variables won't resolve the deadlock issue. I suggest instead we postfix their key parameter with "forward" and "backward".

Hello, @RogerChern. I also met a deadlock issue while training PSPNet on gluon-cv. For the "key parameter" you mentioned above, do you mean the one in this line? Could you please share more details about the fix? Thank you.

zhanghang1989 · 2018-09-24T17:31:04Z

Please set the ndev to the number of gpus used. In gluoncv, please pass the parameter --ngpus 4 if you are using 4 gpus.

jianchao-li · 2018-09-24T17:37:42Z

Hello, @zhanghang1989. Thank you for your reply. I will try it tomorrow morning and update the result with you.

Update

Hello, @zhanghang1989. I am not quite sure about whether you suggested me to explicitly set --ngpus 4. Actually I have only 4 GPUs on the machine and the default value of ngpus is len(mx.test_utils.list_gpus()), which actually returned 4 in my case. The logs of print(args) also convinced me about this.

pengwangucla · 2019-04-17T03:22:35Z

HI Hang, I used your sync_bn implementation for mxnet symbol. However, it reduced the performance of my network. I wonder whether you have ever tried with symbol API with your sync_bn other than gluon. Thanks

zhanghang1989 · 2019-04-17T03:27:40Z

Asked here #8458 (comment)

ngunauj · 2019-07-26T12:03:46Z

How to use it?

zhanghang1989 · 2019-07-26T22:11:36Z

How to use it?

#11502 (comment)

sync batch norm

3f43194

zhanghang1989 requested a review from szha as a code owner June 29, 2018 23:43

zhanghang1989 mentioned this pull request Jun 30, 2018

AttributeError: 'module' object has no attribute 'SumSquare' dmlc/gluon-cv#138

Closed

zhanghang1989 changed the title ~~[MXNET-614] Adding Synchronized Batch Normalization~~ [MXNET-614] [WIP] Adding Synchronized Batch Normalization Jun 30, 2018

zhanghang1989 added 3 commits June 30, 2018 12:43

global rank and barrier

7d58073

lint

9908468

cpplint

3ce37a3

zhanghang1989 mentioned this pull request Jun 30, 2018

[MXNET-246] operators for Synchronized BatchNorm #10303

Closed

8 tasks

zhanghang1989 added 2 commits June 30, 2018 14:07

pylint

a32ec07

doc

cd6d93b

add ref

a3bd860

zhanghang1989 mentioned this pull request Jun 30, 2018

what if I write a syncBN from mx.symbol.BatchNorm? StacyYang/MXNet-Gluon-SyncBN#1

Open

zhanghang1989 changed the title ~~[MXNET-614] [WIP] Adding Synchronized Batch Normalization~~ [MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization Jun 30, 2018

zhanghang1989 added 3 commits July 1, 2018 12:31

customized barrier

a76dad4

cpplint

b3793bf

get rid of pthread

b31fcfe

zhanghang1989 changed the title ~~[MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization~~ [MXNET-614] Adding Synchronized Batch Normalization Jul 2, 2018

zhreshold suggested changes Jul 2, 2018

View reviewed changes

zhanghang1989 added 3 commits July 2, 2018 11:34

address comments

8f58e63

warning

cc60d11

pylint

1c87e6e

This was referenced Jul 2, 2018

SyncBatchNorm Training + ResNetV1C dmlc/gluon-cv#191

Merged

[MXNET-317] Add Data Parallel #10536

Closed

zhanghang1989 added 6 commits July 2, 2018 15:36

Merge remote-tracking branch 'upstream/master' into syncBN

31d7c2f

gpu unitest

4bb0f3a

gpu 0

62d8891

mv to cpu test

24543c9

Revert "mv to cpu test"

50f8593

This reverts commit 24543c9.

Merge remote-tracking branch 'upstream/master' into syncBN

9239793

zhanghang1989 added 2 commits July 9, 2018 14:33

Merge remote-tracking branch 'upstream/master' into syncBN

38ba23d

fix deconstruction

3a439d0

zhanghang1989 requested a review from anirudh2290 as a code owner July 10, 2018 17:45

zhanghang1989 added 2 commits July 10, 2018 13:24

doc indent

a7918e0

doc

55fef70

doc

9884d60

eric-haibin-lin reviewed Jul 12, 2018

View reviewed changes

zhreshold suggested changes Jul 12, 2018

View reviewed changes

thomelane mentioned this pull request Jul 12, 2018

Synchronized Batch normalization in multi GPUs training #8458

Closed

address comments

a2780ce

zhreshold approved these changes Jul 12, 2018

View reviewed changes

zhanghang1989 added 2 commits July 12, 2018 17:05

typo

16df5d4

asnumpy

809854d

eric-haibin-lin merged commit 3ae4331 into apache:master Jul 14, 2018

[MXNET-614] Adding Synchronized Batch Normalization #11502

[MXNET-614] Adding Synchronized Batch Normalization #11502

Conversation

zhanghang1989 commented Jun 29, 2018 • edited

Description

Checklist

Essentials

Changes

Comments

zhanghang1989 commented Jun 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhanghang1989 Jul 2, 2018 • edited

Choose a reason for hiding this comment

zhanghang1989 commented Jul 10, 2018

zhanghang1989 commented Jul 11, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhreshold commented Jul 12, 2018

eric-haibin-lin commented Jul 14, 2018

miteshyh commented Jul 17, 2018

eric-haibin-lin commented Jul 17, 2018

szha commented Jul 17, 2018

miteshyh commented Jul 20, 2018

jianchao-li commented Sep 24, 2018 • edited

zhanghang1989 commented Sep 24, 2018

jianchao-li commented Sep 24, 2018 • edited

pengwangucla commented Apr 17, 2019

zhanghang1989 commented Apr 17, 2019

ngunauj commented Jul 26, 2019

zhanghang1989 commented Jul 26, 2019

zhanghang1989 commented Jun 29, 2018 •

edited

zhanghang1989 Jul 2, 2018 •

edited

jianchao-li commented Sep 24, 2018 •

edited

jianchao-li commented Sep 24, 2018 •

edited